A novel Q-learning algorithm based on improved whale optimization algorithm for path planning

Q-learning is a classical reinforcement learning algorithm and one of the most important methods of mobile robot path planning without a prior environmental model. Nevertheless, Q-learning is too simple when initializing Q-table and wastes too much time in the exploration process, causing a slow convergence speed. This paper proposes a new Q-learning algorithm called the Paired Whale Optimization Q-learning Algorithm (PWOQLA) which includes four improvements. Firstly, to accelerate the convergence speed of Q-learning, a whale optimization algorithm is used to initialize the values of a Q-table. Before the exploration process, a Q-table which contains previous experience is learned to improve algorithm efficiency. Secondly, to improve the local exploitation capability of the whale optimization algorithm, a paired whale optimization algorithm is proposed in combination with a pairing strategy to speed up the search for prey. Thirdly, to improve the exploration efficiency of Q-learning and reduce the number of useless explorations, a new selective exploration strategy is introduced which considers the relationship between current position and target position. Fourthly, in order to balance the exploration and exploitation capabilities of Q-learning so that it focuses on exploration in the early stage and on exploitation in the later stage, a nonlinear function is designed which changes the value of ε in ε-greedy Q-learning dynamically based on the number of iterations. Comparing the performance of PWOQLA with other path planning algorithms, experimental results demonstrate that PWOQLA achieves a higher level of accuracy and a faster convergence speed than existing counterparts in mobile robot path planning. The code will be released at https://github.com/wanghanyu0526/improveQL.git.


Introduction
With technological developments and increasing demand, the scope for mobile robots is becoming more extensive, including applications in machine automation and in fields such as construction, the military, and agriculture [1]. The path planning ability of a mobile robot strategies, Shi et al. [25] used Q-learning for adaptive servo gains adjustment, and proposed a fuzzy-based method for tuning the learning rate in order to improve Q-learning performance. Li et al. [26] proposed a novel off-strategy interleaved Q-learning algorithm by introducing behavior control strategy. As regards Q-table improvements, Wang and de Silva [27] proposed a new distributed Q-learning algorithm, which updates a Q-table with local rewards to reduce Q-learning spaces. Song et al. [28] applied a dynamic wave expansion neural network to specified initializations of Q-values in a Q-table, which improved the convergence efficiency of Qlearning. Konar et al. [29] reduced the number of repeated updates of a Q-table by assuming the distance from the current state to the next state and the target, thereby reducing the time complexity of the algorithm.
However, Q-learning with the above improvements still needs to calculate all possible action-states, and therefore it still has the disadvantage of slow convergence speed. To solve this problem, metaheuristic optimization algorithms have been applied to improve the initialization phase of Q-learning. This provides a better initial state for Q-learning, and reduces the amount of time required for calculation and subsequent convergence.
A number of metaheuristic optimization algorithms have been proposed. Kennedy and Eberhart [30] proposed Particle Swarm Optimization, which originates from the predatory behavior of birds and seeks an optimal solution through collaboration and information sharing between individuals in the population. Passino [31] proposed Bacterial Foraging Optimization, which is a bionic random search algorithm that imitates the behavior of E. coli swallowing food in the human intestine. Rashedi et al. [32] proposed the Gravitational Search Algorithm, which uses the gravitational force between particles in the population to guide the movement of each particle to find the optimal solution. Yang [33] proposed the Bat Algorithm, which is a heuristic algorithm simulating bats in nature. The Bat Algorithm is mainly focused on searching for prey and avoids obstacles by simulating ultrasound to find the global optimal solution. Mirjalili [34] proposed Moth-Flame Optimization, a swarm intelligence optimization algorithm inspired by natural laws to simulate the spiral flight path of moths based on their navigational mechanism during flight. Mirjalili et al. [35] proposed the Gray Wolf Optimizer, which is an optimized search method based on the social hierarchy of gray wolves and inspired by their predatory activities. Inspired by the precise navigation of birds over long-distance aerial paths, Zamani et al. [36] proposed a novel differential evolution algorithm named the Quantum-based Avian Navigation Optimizer Algorithm (QANA). Mohammad et al. [37] proposed an efficient binary version of the QANA named BQANA to solve the feature selection problem of high-dimensional medical datasets. To solve engineering optimization challenges, Zamani et al. [38] proposed the Starling Murmuration Optimizer, which is based on the behavior of starlings during their stunning murmurations. Mirjalili and Lewis [39] proposed the Whale Optimization Algorithm (WOA) to simulate the spiral hunting behavior of humpback whales.
WOA has the advantages of a relatively simple concept, does not require gradient information, and is easy to implement. Therefore, in this paper WOA is selected to optimize the initial Q-table. However, similar to other population-based heuristic algorithms, WOA still has the problem of slow convergence speed and low convergence accuracy. Mafarjaa et al. [40] combined WOA with Simulated Annealing (SA), and implemented feature selection by embedding SA into WOA. Mafarjaa et al. [41] used WOA with an improved stochastic process or WOA with crossover and mutation operators in feature selection. Kaveh and Ghazaan [42] improved the original formula of WOA to enhance the convergence speed and increase the level of accuracy of the original algorithm. Combining WOA with a local search strategy, Abdel-Basset et al. [43] proposed the Hybrid Whale Algorithm to solve the problem of shop scheduling. To balance the capabilities of exploration and exploitation in WOA, Kaur and Arora [44] introduced chaos theory into WOA and proposed the Chaotic Whale Optimization Algorithm to improve and enhance the performance of the original WOA. Mohammad et al. [45] proposed an Enhanced Whale Optimization Algorithm (E-WOA) using a pooling mechanism and three improved search strategies named migrating, preferential selecting, and enriched encircling prey. E-WOA was applied to medical datasets to verify the effectiveness of the algorithm, especially to detect coronavirus disease  in 2019. Mohammad et al. [46] proposed an efficient hybrid algorithm that combined WOA with an improved Month-Flame Optimization algorithm to solve the optimal power flow problem in power systems.
However, path planning methods for mobile robots using ε-greedy Q-learning still have three defects. First, Q-learning initializes a Q-table to zero at the time of initialization, increasing the time to calculate and update the Q-table, and subsequently resulting in a slow convergence process. Second, the strategy of ε-greedy Q-learning selects the next state in the exploration process too randomly, which means too much time is wasted in the exploration process. Third, because the value of ε is fixed, ε-greedy Q-learning path planning cannot switch processes flexibly between exploration and exploitation under any circumstances.
In order to solve the above problems, this paper proposes a new path planning algorithm named the Paired Whale Optimization Q-learning Algorithm (PWOQLA), which is based on an improved WOA and an improved ε-greedy Q-learning. Firstly, in order to correct the shortcoming of slow convergence caused by Q-learning initialization, WOA, as a metaheuristic optimization algorithm, is chosen for Q-table initialization instead of simply setting the values of a Q-table to zero. In this way, a Q-table that contains previous experience is learned before the exploration process. Thus, in the subsequent Q-learning path planning, the calculation time is reduced and a path with fewer steps and smoother corners is obtained.
Secondly, based on the pairing behavior of whales, a Paired Whale Optimization Algorithm (PWOA) is proposed to accelerate the convergence speed of WOA. The main innovation of PWOA is to pair each whale when initializing the population. When one paired whale finds a prey position, the position of the other paired whale is updated to the same prey position. The result of this improvement is to accelerate the speed of the whale population approaching a local optimal solution. Compared with the original WOA, PWOA further improves the convergence speed of Q-learning when initializing the Q-table.
Thirdly, to improve the convergence efficiency of ε-greedy Q-learning, which uses a random exploration strategy, a novel selective exploration strategy (SES) is proposed based on the relationship between current agent position and target position. During each exploration, the agent judges the relationship between those two positions. Based on the judgment of the relationship, the agent will selectively explore two directions that are closer to the target position, instead of exploring four directions at random. SES reduces the number of useless explorations to achieve the purpose of accelerating the convergence speed of ε-greedy Q-learning.
Fourthly, in order to switch flexibly between exploration process and exploitation process, we propose a nonlinear function that changes the value of ε in ε-greedy Q-learning dynamically based on the number of iterations. In other words, the exploitation probability of εgreedy Q-learning gradually increases as the number of iterations increases, whereas the exploration probability of the surrounding environment decreases. Therefore, by changing the value of ε dynamically, exploration and exploitation can be switched flexibly.
Finally, combining the above improvements to Q-learning and WOA, the result is the proposed PWOQLA. In PWOQLA, PWOA is applied to the Q-table initialization phase of the improved Q-learning. Compared with the original Q-learning algorithm, PWOQLA is more accurate and more efficient at robot path planning, and experimental results show that the proposed algorithm has a greater level of accuracy and faster convergence compared to several path planning algorithms with similar functions.
The rest of this paper is structured as follows. The second section introduces ε-greedy Qlearning and WOA. The third section introduces the working principles and steps of PWOQLA. The fourth section compares PWOQLA with similar algorithms and discusses the experimental results. The fifth section draws the conclusions.

Q-learning
Q-learning [17] is a model-less algorithm that is one of the main reinforcement learning algorithms. In the Markov environment, Q-learning has the ability to learn and provides an intelligent system to select the best action using experienced action sequences. Q-learning learns through the Q-value function. In the Q-value function, the state transition probability and the next status decide the current state and the selected action, and the agent receives an instant return after the selected action. The strategy of Q-learning is to find maximum rewards way into the future. Q-learning is called model-less because it compares the expected values of actions without an environmental model, and this is the advantage of Q-learning.
In Q-learning, each Q (s, a) has a corresponding Q-value. In the subsequent learning process, the next action is selected according to Q (s, a). The sum of the rewards obtained from executing a certain strategy and performing the current action is defined as the Q-value. The optimal Q-value is defined as the sum of the rewards acquired by executing related actions and executed according to the optimal strategy, which is defined as follows: In Eq (1), s t is the current state; a t is the action performed in state s t ; r t+1 is the reinforcement signal received after s t is executed and is also called the reward; s t+1 is the next state; γ is a discount factor (0 � γ < 1); and α is a learning coefficient (0 < α < 1). Each agent learning process can be considered as starting from a random state and adopting an ε-greedy strategy or Boltzmann distribution strategy to select the next actions. The εgreedy strategy is used in decision making. For example, when ε is initialized to 0.9, it means that there is a 90% probability that the agent will choose a behavior according to the optimal value of the Q-table, and a 10% probability of choosing a random selection. To allow the agent to search for all possible actions and update each Q (s, a) for each action, the random selection strategy is adopted. The agent observes the new state after executing the selected action. The Q (s, a) of the previous state and action is then updated in response to the maximum Q-value and the return of the new state. Based on the new state, the agent will continue to choose actions until it reaches the end state.

Whale Optimization Algorithm
The Whale Optimization Algorithm (WOA) is a heuristic optimization algorithm proposed by Mirjalili Seyedali [39]. The algorithm proceeds as follows. In the search space for the optimization problem, each humpback whale is a candidate solution, called a search whale. A set of search agent whales is used to find the global optimum of an optimization problem in WOA. For a given problem, the search process starts from a random solution when initializing and then updates candidate solutions according to optimization rules until the final criterion is met. In fact, WOA simulates the behavior of humpback whales looking for and attacking prey.
Encircling prey. The humpback whale recognizes the location of its prey and surrounds the prey. Because the prey position of the optimal solution is a priori unknown, the target prey position is presumed to be the current optimal solution, and other search whales update their positions through the "target prey". The mathematical model of prey behavior is as follows: In the above, X ! ðtÞ is the current position vector; G ! ðtÞ is the current optimal solution position vector; D ! is the distance between the search whale and the target prey; t is the current number of iterations; and A ! and C ! are coefficient vectors.
If there is a better optimal solution position vector, G ! ðtÞ should be updated in the current iteration. The formulae for calculating A ! and C ! are as follows: In the above, r ! a , and r ! c are random vectors in the range [0, 1], and a ! decreases linearly from 2 to 0 during the iteration.

Bubble-net strategy
Humpback whales move around their prey in a spiral path and simultaneously spit out bubbles to create traps. This is known as the bubble-net strategy for hunting prey. In the WOA model, the contraction and surround mechanism is achieved by reducing a ! in Eq (4). The fluctuation range of A ! decreases with decreases in a ! . According to Eq (4), A ! is a random value in the interval [−a, a]. Setting A ! to be a random value in the interval [−1, 1], the position of the search whale will move randomly to any position between the current optimal solution position and the previous position. The new position of the search whale is calculated as follows: In the above, D ! 0 means the best solution obtained so far, which represents the distance from the whale to the prey; l is a random number in the range [−1, 1]; and b is the constant of the logarithmic spiral shape and can be set to different values according to specific application scenarios. The humpback whale swims along a spiral path toward its prey. To update the whale's predatory position, the mathematical model of the whale's spiral path is as follows: ( In Eq (8), the variable p is a random number between 0 and 1. Searching for prey. As well as using the bubble-net strategy in the exploitation process, humpback whales also need to search for prey randomly in the exploration process. The mathematical model of searching for prey is as follows: In the above, X rand ��! ðtÞ is the position vector of a search whale randomly selected from the population.
In order to ensure exploration and convergence, when j A ! j � 1, the randomly selected search whale becomes the key point when other whales update the position. In other cases (j A ! j < 1), the current optimal solution position plays a pivotal role in updating other search whales.

Methods
The Paired-Whale Optimization Q-learning Algorithm (PWOQLA) is a path planning algorithm that uses an improved WOA to initialize the Q-value of an improved ε-greedy Q-learning. The aim of PWOQLA is to overcome the disadvantage of slow convergence in the original ε-greedy Q-learning. The first part of this section introduces the Whale Optimization Q-learning Algorithm (WOQLA), which combines the algorithms of the original Q-learning and original WOA. The second part introduces the process of improving WOA, and the third part introduces the process of improving ε-greedy Q-learning. The final part introduces the application of PWOQLA in path planning.

Whale Optimization Q-learning Algorithm
The Whale Optimization Q-learning Algorithm (WOQLA) is an algorithm for mobile robot path planning that combines the original WOA with Q-learning initialization. To overcome the shortcomings of Q-learning, such as slow convergence caused by initialization, WOQLA optimizes the initialization of the Q-table instead of simply setting the Q-values to zero.
In the initialization phase, WOQLA generates a number (n) of whale populations in a 20 × 20 grid space and uses the Q-value calculation in Eq (1) to calculate the fitness value of each whale. The position with the highest Q-value represents the best whale position. The WOA is then used to optimize the Q-value of each whale in the whale population. When the maximum number of iterations is reached, the initialization of the Q-table by WOA ends. The original ε-greedy Q-learning is then used for path planning according to the newly obtained Q-table. Based on the newly initialized Q-table, the Q-value calculation formula is used in the iterative update of the Q-table. After the iteration is completed, the final Q-table is obtained. From this Q-table, we can find the path with the largest Q-value, which represents the best path. In this way, the Q-table containing previous experience is learned through WOA before the Q-learning algorithm searches, which helps to reduce the subsequent calculation time and accelerates the speed of Q-learning convergence.

Paired-Whale Optimization Algorithm
WOA has the problems of slow convergence and low levels of accuracy. Inspired by research on humpback whales and the observation that they perform activities mostly in pairs [47][48][49][50], this paper proposes the Paired-Whale Optimization Algorithm (PWOA) based on the pairing behavior of humpback whales. Pairing behavior helps paired whale individuals to find food faster with the help of their peers, which accelerates the convergence speed of the original WOA. The main improvement of PWOA compared with the original is that the algorithm finds a mate for each whale when initializing the whale population, so that each whale has another whale paired with it. The pairing strategy is pairwise pairing in the order of randomly generated whale individuals. The original WOA is then executed. At each iteration, the fitness value of each whale is calculated. When whale G is found to be in the best position, the algorithm compares the fitness value of whale G with the fitness value of the pair of whale G. If the fitness value of whale G is large, the position of the paired whale is updated to the position of whale G. If the fitness value of whale G is small, the position of whale G is updated to the position of the paired whale. In this way, in each cycle, each pair of whales chooses the better position of the two to find their prey at the same time, then conducts their own exploration or exploitation process respectively. The current number of iterations in PWOA is t PW . The result is that the convergence speed of the final algorithm is accelerated. Such improvements accelerate the speed of the whale population approaching the local optimal solution, and also accelerate the convergence of the final algorithm. The pseudo-code of the PWOA algorithm is shown in Algorithm 1, and a flowchart is shown in

Improved Q-learning
In order to simplify a real-world application scenario, a 20 × 20 grid space is used to model the real environment. We assume that each grid position corresponds to a corresponding coordinate in real space. The value of each grid is mapped to the Q-table of ε-greedy Q-learning. When the reward is −1, the grid is an obstacle. When the reward is +1, the grid is free space. When the reward is 100, the grid is the target position. As shown in Fig 2, an agent has four random actions at position s t : action 1 goes up; action 2 goes right; action 3 goes down; and action 4 goes left.
In order to improve ε-greedy Q-learning, firstly, this paper proposes a novel selective exploration strategy (SES) based on the target position, with the aim of improving the convergence efficiency of the original ε-greedy Q-learning and to reduce the number of useless explorations. During each exploration, the agent first judges the relationship between the current agent position s t (x t ,y t ) and the target agent position s g (x g ,y g ). The agent will then explore two If x t �x g and y t <y g , then a =rand (1,2); if x t <x g and y t �y g , then a = rand (2,3). If x t >x g and y t �y g , then a = rand (3,4); if x t �x g and y t >y g , then a = rand (1,4). Secondly, in order to switch flexibly between exploration and exploitation process, the ε value in ε-greedy Q-learning is changed dynamically. The equation for calculating the value of ε is as follows: In Eq (11), u is the value of ε at the beginning of the iteration, and v is the incremental range of ε. The sum of u and v is set to 1, and u and v are temporarily set to 0.6 and 0.4 respectively. The variables a and b are the coefficient parameters of the dynamic curve, and their values are determined by the number of iterations. The variable t is the current number of iterations. The change curve of ε is shown in Fig 3. It can be deduced from Eq (11) that the changing trend of ε is decreasing. Therefore, as the number of iterations increases, the development probability of ε-greedy Q-learning is

PLOS ONE
gradually increased, and the probability of exploring the surrounding environment is reduced. Consequently, the early stage attends to exploration capability whereas the later stage attends to exploitation capability.
To summarize these two improvements, firstly, we reduce the number of useless explorations by referring to the relationship between the target and the current position through the SES strategy; and secondly, the proportion of exploration and exploitation in different periods is planned reasonably with the ε curve. These two improvements help to reduce the amount of calculation in the algorithm and save computing resources.

Paired-Whale Optimization Q-learning Algorithm
The Paired-Whale Optimization Q-learning Algorithm (PWOQLA) is a path planning algorithm that uses PWOA to initialize the Q-table in improved Q-learning, instead of simply initializing the Q-table to zero. During initialization, a number of whale populations (n) is generated and matched in a 20 × 20 grid space. The Q-value calculation in Eq (1) is used to calculate the fitness value of each whale. The position with the highest Q-value represents the best whale position. PWOA is then used to optimize the Q-value of the whale population. After the number of iterations reaches the maximum, Q-table initialization is completed. Subsequent path planning is performed using the improved ε-greedy Q-learning based on the newly obtained Q-table. The current number of iterations in PWOQLA is t PWQ , and the final result represents the best path. The pseudo-code of PWOQLA is shown in Algorithm 2, and a flowchart of the algorithm is shown in

Experiment
In this section, we simulate the effect of PWOQLA in mobile robot path planning in a grid environment with 20 × 20 obstacles. The original Q-learning, Improved Decentralized Qlearning (IDQ) [51], A � algorithm [52], and WOQLA are compared with PWOQLA to verify the effectiveness of PWOQLA in path planning. The A � algorithm is one of the basic algorithms for path planning. IDQ is one of the classical algorithms for improving Q-learning and is often used as a comparison algorithm.

Experimental environment and parameters
The experiment is performed in a 20 × 20 grid. The number of simulations is set to 30 as this number of simulation samples represents the general sample-generation quantity that is sufficient to measure the performance of the algorithm [30,33,39]. The side length of each grid is a standard unit, and there is no fixed definition here. As shown in Fig 2, a mobile robot has four motion directions: action 1 moves forward, action 2 moves right, action 3 moves backward, and action 4 moves left. The reward and punishment rules are the same in Q-learning, IDQ, WOQLA, and PWOQLA. If the mobile robot encounters an obstacle, the penalty Qvalue is reduced by 1; if the mobile robot moves to free space, the reward Q-value is increased by 1; and if the mobile robot moves to find the target position, the reward Q-value is increased to 100. Finally, a 400 × 4 Q-table containing all the information on 400 positions and 4 actions at each position is established.
In Q-learning, α is the learning coefficient, and reflects the degree to which the previous Qvalue is retained. Following the work of Khriji et al. [53], α is set to 0.2 in this study. The variable γ is the discount factor, which represents how the agent treats future rewards when it receives the current reward. In this study, γ is set to 0.8 [53]. The number of iterations t in Qlearning is set to 500. In WOA, r ! a , r ! c are random vectors in the range [0, 1], and a decreases linearly from 2 to 0 during the iteration. The parameter b is a constant defining the logarithmic spiral shape, which is set to 1 in this study [54]; and l is a random number in the range [−1, 1]. The variable p is a random number between 0 and 1. The population number n of WOA is set to 30, and the number of iterations t PW and t PWQ are both set to 500. The parameter settings are shown in Table 1.
The experiment was conducted on an Intel (R) Core (TM) i7-7500U CPU @ 2.70 GHz platform. The graphics card was an Intel (R) HD Graphics 620, and the simulation software was MATLAB R2020a.

Experimental results and analysis of PWOA
By solving 8 classic benchmark functions used in the optimization problem, the efficiency of the PWOA is compared with the original WOA. In this experiment, the population size is 30 for both WOA and PWOA, and the maximum iteration is set to 500. The benchmark functions used in the experiment include three types: unimodal (F1, F2, F3); multimodal (F4, F5, F6); and fixed-dimension multimodal (F7, F8) [55]. Table 2 summarizes the details of the benchmark functions in these experiments, including the cost function, the number of design variables V_no, the range of optimization variables, and the optimal value f min .
The algorithms WOA and PWOA are run 30 times each for every benchmark function, starting with different populations randomly generated.  Table 3, which include the average and standard deviation of the test cases. Table 3 shows that in F1, F2, F3, F4, F6, F7 and F8, both WOA and PWOA have reached the target optimal value, but the average value and standard deviation of the optimal fitness obtained by PWOA are both superior to WOA. This demonstrates that PWOA has greater convergence accuracy and algorithm stability compared with WOA. In F5, however, the standard deviation of the optimal fitness of PWOA is higher than that of WOA, indicating that the stability of PWOA is slightly worse. On the other hand, both WOA and PWOA eventually converge to the optimal value, and the average value of the optimal fitness of PWOA is better, which also reflects the advantages of PWOA.  It can be seen from Fig 5 that in F3, F5, F7 and F8, when the difference between WOA and PWOA in the final convergence accuracy is not large, PWOA finds the optimal solution in the case of fewer iterations, indicating that PWOA has an excellent convergence speed compared to the original WOA. In F1, F2, F4 and F6, when the number of iterations is the same, the convergence accuracy of PWOA is higher than that of the original WOA.
Because the improved strategy of PWOA is to pair agents in the population in advance, the paired agents can share the information found in each iteration and help each other to move towards a better agent position. This strategy is equivalent to one agent exploring the search space twice in one iteration, which is more efficient. With the same parameters, PWOA is naturally superior to the original WOA. In general, PWOA improves the exploitation capability

Experimental results and analysis of PWOQLA
In this experiment, the performance of PWOQLA is compared with that of the original Qlearning, IDQ, A � algorithm, and WOQLA. We compare the averages and standard deviations of the running time, the number of path steps, and the number of rotation angles during 30 simulations, and analyze features, advantages, and disadvantages. The running time simply and directly reflects the efficiency of the algorithm: the shorter the time, the smaller the complexity of the algorithm, and the higher the efficiency. Because it is a square grid with the same distance between each node, the comparison of the number of path steps is equivalent to a comparison of the path length. Finding the shortest path is one of the ultimate goals of the path planning algorithm. Finally, the number of rotation angles is calculated to obtain the steering situation of the mobile robot under actual conditions. If the number of rotation angles is small, it means that the path is smoother, and that the mobile robot traveling along the path does not need frequent changes of direction, which means that the path planning algorithm is naturally more efficient. Figs 6-15 show the best paths obtained respectively by the A � algorithm, Q-learning, IDQ, WOQLA and PWOQLA in 6 experiments composed of different types of grid maps. In these figures, the blue dot represents the starting point, the red dot represents the destination, the grey squares represent the obstacles, the white squares represent free space, and the green line represents the final path.  Table 4 shows a comparison of the time to calculate the target position in the maps. Table 5 compares the number of path steps, and Table 6 compares the number of rotation angles.
It can be seen from Table 4 that although the calculation time of the A � algorithm is shortest, the cost function f(n) of the A � algorithm only considers the target position and does not consider the obstacles in the map. This leads to the A � algorithm having the largest average and the largest standard deviation of the number of rotation angles, as can be seen from Table 6, which shows that the path smoothness of the A � algorithm is the worst.  Table 4 also shows that, for the original Q-learning, when the number of obstacles increases, the calculation time gradually decreases. This is because the number of obstacles increases while the exploration space decreases, which saves time. In contrast, for IDQ, the calculation time gradually increases. This is because a greater number of local optimal solutions may be generated with the increase in the number of obstacles, which wastes time in this case. The calculation time for WOQLA and PWOQLA also increases with additional obstacles. When there are more obstacles in the map, the calculation time increases because early initialization may entrap Q-learning in a local optimal solution. When there are fewer obstacles in the map, the possibility of being trapped in a local optimal solution is reduced.
In general, PWOQLA achieves the shortest operation time among the algorithms except for the A � algorithm, followed by WOQLA and the original Q-learning, while IDQ has the longest operation time. According to Table 5, the average step values of the original Q-learning and IDQ are slightly higher than the other algorithms, while the average step value of PWOQLA is the shortest. These results show that in a disordered and irregular obstacle map, PWOQLA has the greatest path planning efficiency because it first optimizes the initialization of the Q-table, which simplifies the subsequent search strategy and accelerates the convergence speed. Experiment 2: Lattice obstacles. Experiment 2 sets out to verify the path planning ability of PWOQLA on a regular lattice map. Fig 11 shows a diagram of the optimal path result.  Table 7 shows a comparison of the time to find the target position, the path step length, and the number of rotation angles for each algorithm in the experiment. Lattice obstacles test mainly the smoothness of the planning path. It can be seen from Fig 11 and Table 7 that the path planned by PWOQLA is the smoothest, with the shortest computational time and step value. Compared with the A � algorithm, the average number of rotation angles for PWOQLA is an improvement of 31.6%. Compared with the original Q-learning, the average number of rotation angles for PWOQLA is an improvement of 31.0%. The standard deviation in the number of rotation angles is also the smallest for PWOQLA, indicating that the path of PWOQLA is more stable and always finds the target with fewer rotations. Experiment 3: Strip obstacles. Experiment 3 studies the path planning of each algorithm when the initial position and target position are separated by multiple long-strip obstacles and the target cannot be reached directly. Fig 12 shows a diagram of the optimal path result. Table 8 shows the experimental results for each algorithm. Because the terrain is too complex, and the path to the destination needs to pass no less than 3 obstacles, the original Q-learning has the longest calculation time. The calculation time for PWOQLA is still the shortest. The average number of path rotation angles for IDQ, WOQLA and PWOQLA are all less than the original Q-learning, indicating that these algorithms have improved the smoothness of the optimized path. Their standard deviations in the number of rotation angles has also been reduced, indicating that the path is more stable.

PLOS ONE
Experiment 4: Horizontal obstacles. Experiment 4 sets out to test the ability of these algorithms to pass through straight and narrow paths. From Table 9, it can be seen that the A � algorithm, Q-learning and IDQ cannot find the best path via the shortest route in each simulation, whereas WOQLA and PWOQLA perform well in this respect. Although WOQLA and PWOQLA have no advantage in computing time, the number of rotation angles has been significantly reduced. Compared with the original Q-learning, the number of rotation angles for WOQLA is reduced by 25.6% and the number for PWOQLA is reduced by 32.4%. Additionally, Fig 13 shows that WOQLA and PWOQLA take smoother paths and fewer turns to reach the destination. If the algorithm is applied to the path planning of mobile robots, these algorithms will reduce the time for the mobile robot to rotate and change direction, and save resources.
Experiment 5: Room type obstacles. Experiment 5 is a simulation of finding the target position in a room and testing the path planning ability of the algorithms when there are room-type obstacles. It can be seen from Fig 14 that the path found by PWOQLA is the best visible to the naked eye. Table 10 indicates that, except for the A � algorithm, PWOQLA has the shortest average operation time and is the most suitable for this scenario, while each algorithm can find the path with stability and the least number of steps every time. Experiment 6: Concave obstacles. The aim of Experiment 6 is to simulate finding the target position in a narrow concave tunnel. Combining Fig 15 and Table 11, it can be seen that

PLOS ONE
the average number of rotation angles of the original Q-learning, IDQ, WOQLA and PWOQLA are almost the same, while the standard deviations are not much different, indicating that these algorithms are suited to this kind of scene. However, compared to other scenarios, the best path with the least number of steps cannot be found every time. In the Experiment 6 scenario, the A � algorithm and the original Q-learning will take detours to find the target position during some simulations, whereas IDQ, WOQLA and PWOQLA are more stable, being able to find the path with the least number of steps more often.
Wilcoxon rank-sum test. The Wilcoxon rank-sum test is a nonparametric hypothesis test, which is used to infer whether there is a difference between the distribution positions of two populations. It reflects the correlation of the experimental results of each algorithm in 30 independent runs. In this test, a p-value with a 95% significance level was computed, which means that when the test value is less than 0.05, it indicates that there is a significant difference between the experimental data of different algorithms. And the corresponding results for computational time, path steps and number of rotation angles are reported in Table 12.
According to Table 12, it can be seen that under different experimental map environments, compared with other path planning algorithms, data distribution on computational time is significantly different in PWOQLA, which indicates that PWOQLA has significantly improved the path planning time, proving its superiority. There is no significant difference between some data of path steps and number of rotation angles. First, due to the limitation of data

PLOS ONE
types, and second, because other algorithms are excellent enough in the performance of these two test indicators, so the results in Table 12 are obtained.

Discussion
In general, except for the A � algorithm, PWOQLA has the best performance in these experiments, which is shown in the shortest calculation time and the smoothest path. The reason why PWOQLA has the best performance is that it uses PWOA to improve the initialization

PLOS ONE
larger number of steps and a larger number of rotation angles. Considering the actual requirements for mobile robot path planning, PWOQLA is obviously better. In Experiment 4, the calculation time of PWOQLA is longest in five simulations. The reason why PWOQLA does not perform well in Experiment 4 is that the exploitation ability of PWOQLA has not been improved. When the obstacle area is relatively large and concentrated, the disadvantages of PWOQLA are more obvious. PWOQLA focuses more on improving the exploration strategy, mainly improving the calculation time of the algorithm. In Experiment 3, however, the path planned by PWOQLA is the smoothest, and both the average and the standard deviation of the

PLOS ONE
rotation angles are the smallest. This shows that the pre-treatment during PWOQLA initialization compensates for the exploitation ability to a certain extent, enhancing an understanding of the map, and helping PWOQLA to find the optimal path. Comparing PWOQLA performances in all of these experiments, Experiment 6 is especially notable. In navigating concave obstacles, PWOQLA takes the shortest time in five simulations to find the target, with an average of only 0.85s and a standard deviation of 0.05s, which is the shortest too. The results of Experiment 6 show that PWOQLA performs best in computational time, indicating that it is the most suitable for the path planning of mobile robots in concave obstacle maps similar to this experiment, such as maps with many curves or narrow tunnels. In Experiment 4 on the other hand, PWOQLA takes the longest time compared with other PWOQLA performances, with an average computational time of 0.96s. This shows that PWOQLA is not the best performer when obstacles are regular and repeated. Thus, the exploitation capability of PWOQLA should be further improved.
Although the exploitation capability of PWOQLA is insufficient, the experimental results show that PWOQLA still meets the requirements of speeding up path planning time and finding the best path with fewer rotation angles. Moreover, PWOQLA overcomes the disadvantage of slower convergence in the original Q-learning.

Conclusions
The convergence speed Q-learning is slow because is too simple when initializing the Q-table and wastes too much time in the exploration process. To solve these problems, we propose PWOQLA. Firstly, the WOQLA proposal solves the problem of slow convergence speed caused by the simple initialization of the Q-table. Through this innovation, in which the

PLOS ONE
original WOA is used to initialize the values of the Q-table, a Q-table containing previous experience is obtained before the exploration process. Thus, the convergence speed of ε-greedy Q-learning is accelerated. Secondly, the PWOA proposal speeds up the speed of the whale population approaching the local optimal solution, solving the shortcoming of slow convergence in the original WOA. Thus, the efficiency of Q-learning initialization in WOQLA can be improved by replacing WOA with PWOA. Thirdly, the SES proposal, which utilizes the position relationship between the current agent and the target, reduces the useless exploration of ε-greedy Q-learning and further improves the convergence speed. Fourthly, the proposal of a dynamically changing nonlinear function for ε overcomes the shortcoming that exploration and exploitation cannot be switched flexibly in the original ε-greedy Q-learning. Experimental results show that PWOQLA has greater accuracy and faster convergence speed compared with algorithms with similar functions.
Although PWOQLA balances exploration and exploitation capability in a static environment, the exploitation capability of PWOQLA is insufficient in a dynamic environment. Thus, PWOQLA could be combined with other algorithms that have strong exploitation capabilities when applied to path planning in dynamic or extreme environments. The method of determining the ε dynamic curve parameters in PWOQLA can also be further improved. In future work, we will apply PWOQLA to mobile robot path planning in dynamic or extreme environments and test its performance.